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A new method of quasi-optimal observables allows one to approach the quality of 
data processing usually associated with the method of maximal hkelihood within 
the simpler algorithmic context of generalized moments. 

In this lecture, I'd Uke to explain a recent finding [1] which connects the two basic 
methods of parameter estimation, the method of maximal hkeUhood and the method of 
generalized moments (see e.g. [2]). The two methods (along with the method, which I 
won't discuss) are very well known and widely used in experimental physics. 

In a sense, the connection views the method of maximal likelihood as corresponding to a 
special point in the space of generalized moments, and considers small deviations from that 
point. The point corresponds to the minimum of the fundamental Cramer-Rao inequality, and 
small deviations from it introduce non-optimalities (compared with the maximal hkelihood 
method) that are only quadratic in the deviations. This approach offers what appears to be a 
new and useful algorithmic scheme which combines the theoretical advantage of the method 
of maximal likelihood (i.e. the fact that it yields the absolute minimum for the variance of the 
parameter being estimated with a given data sample) with the algorithmic simplicity of the 
method of moments. 

I call the resulting method the method of quasi-optimal observables. It is useful in 
situations where the method of maximal likelihood fails or cannot be applied, e.g. in high 
energy physics where typically only a Monte Carlo event generator is available but no explicit 
formula for the probabiUty density. 

One deals with a random variable P whose instances (specific values) are called events. 
Their probability density is denoted as 7l(P). It is assumed to depend on a parameter M which 
has to be estimated from an experimental sample of events {P/}, . 

The method of generalized moments consists in choosing a function /(P) defined on 
events (the generalized moment or, using the language of quantum theory, observable), and 
then finding M by fitting its theoretical average value, 

(/) = jdP;r(P)/(P), (1) 

against the corresponding experimental value: 

(/>exp=^i:,-/(P.-)- (2) 

The result of the fit is an estimate for M denoted as M\f]. Once the observable / is chosen. 
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the method is rather easy to use. However, the method says nothing about how to find a 
good /, i.e. one which would minimize the variance D(M\f] ) of the resulting estimate M\f] . 

The method of maximal UkeUhood, on the other hand, prescribes to choose M which 
maximizes the likelihood function 

L=n,^(p,). (3) 

The necessary condition for the minimum then is 

it=Lv!^ = 0. (4) 
dM dM 

The method, if applicable, yields an estimate Mopt for M whose variance D(M opt) is optimal 
because it is asymptotically equal to the minimal value established by the fundamental 
Cramer-Rao inequaUty (cf. Eq. (8.10) m[2]; N is the number of events P,): 



dM 



(5) 



Although theoretically ideal, the method of maximal likelihood may be difficult to make use 
of, e.g. if the number of events is large and/or there is no sufficiently simple regular 
expression for the probability density n. The worst case is, of course, when the explicit 
expression for n is unavailable; this case occurs when all one has is a Monte Carlo event 
generator. 

So, on the one hand, there is a simple but non-optimal method of generalized moments. 
On the other hand, there is a theoretically ideal but cumbersome and often unusable method 
of maximal likelihood. And there is no apparent connection between the two. 

Following [1], let us ask a natural question: is it possible to find an observable / which 
would minimize D(M[/])? If such observable /opt exists, the corresponding ^(Ml/opt]) must 
be directly connected to the r.h.s. of (5). 

The trick used in [1] is as follows. Asymptotically, 
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ND{M[f]) - VarM[/]= Var/, (6) 



where Var / = ^(/-(/))^^ • Then it is sufficient to consider War M\f] as a numeric function 

in the functional space of / and to use the apparatus of functional derivatives to study the 

problem similarly to how one studies minima in ordinary spaces. (A note concerning 
mathematical rigor: the method is valid under the same conditions as the method of maximal 
likelihood, and the usual Hilbert norm of mathematical statistics is to be chosen in the space 
off.) The necessary condition for the minimum is 

VarM[/] = 0. (7) 

5f(P) 

After simple calculations (see [1] for details) one finds the following solution: 
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ain;r(P) 

(In fact, there is a family of solutions, /(P) = C'j/jjpj(P)+ .) Another simple calculation 
yields 



VarM 



-/ f2 
opt 



/opt (foot) • 



In view of (8) and (5) we see that extracting M using the observable (8) is asymptotically 
equivalent to the method of maximal likelihood. 

Once we adopted the viewpoint of analogy with ordinary functions, a natural next step is 
to consider small deviations from /opt and their effect on VarM[/] . To this end, expand 
VaxMlf] in/ around /opt; what we are doing here is a functional analog of the Taylor 
theorem: 



VarM[/„p,+(p] = VarM[/„pJ + ij 



^^VarM[/] 
5/(P)5/(Q) 



(p(P)(p(Q)dPdQ + ... (10) 

f fop 



The term which is hnear in (p does not occur because /opt satisfies (7). ExpUcit calculations 
(see [1] for details) yield: 

VarM[/„^,+,p]= ' + ' |(4,)x(|,^)-(/„^,x?t)^}+... (11) 

\/opt/ I^Jopt) 



where (p-(p-(^(pj. Non-negativity of the factor in curly braces follows from the standard 
Schwartz inequality. From the viewpoint of (11), (p is small if 



J opt I 

Since the deviation from optimaUty is quadratic with respect to the deviation of observables 
from /opt, one realizes that the exact knowledge of the probability distribution k is not really 
necessary: an approximation /quasi to /opt in the sense of (12) may be sufficient. Such an 
approximation could be constructed even using a Monte Carlo generator. 
There are several interesting points about this method. 

• The usual procedures of imposing cuts on events to enhance the signal/background ratio fully 
agree with the above prescriptions. Indeed, suppose ^ = ^bg ^signal' where only the signal 
contribution depends on M. Then 

r _ ^signal ^signal 

^'■bg^ ^''signal ^''bg 

This vanishes where the background is large compared with the signal. 
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• The optimal observable is localized on events where it exhibits the largest variation with 
respect to the parameter being studies — not where K is largest. In addition, such observables 
may have different signs in different regions of phase space, e.g. in the case of parameters such as 

masses. Indeed, for ;r(P) oc the optimal observable with respect to M has the 

(M-P)2+r2 
_p-) 

form fj^ . (P) ~ . Then one has an array of simple shapes to choose from in 

construction of quasi-optimal observables as shown below: 

^(P)A /opt(P)l /quasi(P)l 



M P \j 'xJ \/ (14) 

(a) (b) (c) (d) 

• In the above example, there is another parameter, T. It is straightforward to define an optimal 
observable for this parameter too. In general, with several parameters to be estimated, there is an 
optimal observable per parameter. Error ellipsoids are constructed in the usual fashion. 

• A theoretical prediction for Tt may involve a low order result and some higher order 
corrections. In some cases such corrections will only marginally affect /quasi, so one could 
construct /quasi using the simplest expression for Ttoieoi • However, this affects only the construction 
of /quasi : once the latter is fixed, the extraction of M from data must involve (/quasi ) computed by 

numerical integration of the /quasi thus fixed against the theoretical probability distribution with all 
the corrections taken into account. 

• From the algorithmic viewpoint, the problem of numerical construction of a quasi-optimal 
observable from a MC generator is sister to the problem of MC integration. There is a 
considerable array of options here (cf. [3]), and given the described firm analytical foundation of 
the method, I'd expect it to eventually become a tool of choice in many situations where at 
present less focused methods are used, such as based on neural networks. 

To summarize, parameter estimation via quasi-optimal observables combines, within a 
flexible algorithmic scheme, the optrmaUty of maximal likelihood with the simpUcity of 
generalized moments. 

The method of quasi-optimal observables may be useful in experimental situations 
characterized by: 

=^ high precision requirements and/or low signal; 

=^ many events to be processed and/or the signal not localized sufficiently well for cuts to work; 

a complicated underlying theory (absence of explicit formula for probability distribution Tf, 
complicated higher order corrections; singular theoretical predictions for n). 

Finally, the author is rather uncomfortable with the claim to have discovered a new 
algorithmic scheme for parameter estimation based on such a simple connection between the 
two venerable methods — the methods of generalized moments and maximal likelihood — 
both learnt by 0(10000) students worldwide for about half century. However, I checked a 
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large number of textbooks and monographs on mathematical statistics and its applications 
and failed to find any trace of it being known to the experts. Also, there is an indirect 
evidence: it is safe to say that all known methods of parameter estimation are used in high 
energy physics one way or another (cf. [4]), and although attempts to construct better 
observables using e.g. neural networks abound in high energy physics (cf. [4]), there seems 
to be no trace of the notion of (quasi-) optimal observables being known to high energy 
physicists. This is utterly puzzling as the connection is so simple. So, if the claim of novelty is 
correct, the inevitable question is, why the connection was not discovered sooner? 

The only explanations I can offer involve history and psychology. Indeed, the geometrical 
viewpoint of functional analysis was not wide-spread at the time of discovery of the methods 

of maximal likelihood and generalized moments, and programmers and calculationists still 
have little working knowledge of it. On top of that, neither students nor researchers feel the 
perfunctory proofs of elementary textbook results deserve more than a cursory glance: 
students have so much to learn; mathematicians, so much to prove; data processing experts, 
so much code to debug. In short, no one can afford to indulge in dwelling upon elementary 
results when there is so much hard work to be done to earn one's living. In the case of [1], 
the pattern was broken by an unconventional motivation from the theory of jet observables 
developed in [5]; the theory ran contrary to some prevailing prejudices, and as is usual in 
such cases, the author was under pressure to seek all sorts of arguments to fortify it, which 
led to a foray into the domain of mathematical statistics. Actually, the solution of the old 
problem of finding optimal jet-finding algorithms described in [5] is per se a sufficient proof 
(if such were needed) of usefulness of the concept of quasi-optimal observables. 

I thank M. Kienzle-Focacci and P. Bhat for their interest. This work was supported in 
parts by the RFBR grant 99-02-18365 and the NATO grant PST.CLG.977751. 
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